Automated Cleanup Processing for Extracting Bibliographic Data from Biomedical Online Journals
نویسندگان
چکیده
An R&D division of the National Library of Medicine (NLM) has developed the Web-based Medical Article Records System (WebMARS) to create citations from online biomedical journals. This paper presents one important part of this system, the automated cleanup module that extracts bibliographic information from HTML-formatted text based on a rule-based approach. A learning scheme comparing the output of the cleanup module to the verified processing result is newly introduced to create and update cleanup rules automatically, thereby minimizing the manual effort for rule setting and improving the performance of the cleanup processing. Experimental results show that the proposed automated cleanup module can effectively detect and extract the bibliographic data of interest from HTML-formatted online journal articles using relevant rules identified through the learning process.
منابع مشابه
Automated Labeling Of Biomedical Online Journal Articles
An automated labeling (AL) module has been developed to automate the extraction of bibliographic data (e.g., article title, authors, affiliation, abstract, and others) from online biomedical journals for the National Library of Medicine’s MEDLINE database. The AL module employs string matching, statistics, and fuzzy rule-based algorithms to identify segmented zones in an article’s HTML pages a...
متن کاملAutomated labeling of bibliographic data extracted from biomedical online journals
A prototype system has been designed to automate the extraction of bibliographic data (e.g., article title, authors, abstract, affiliation and others) from online biomedical journals to populate the National Library of Medicine’s MEDLINE® database. This paper describes a key module in this system: the labeling module that employs statistics and fuzzy rule-based algorithms to identify segmented ...
متن کاملAutomated Labeling from Biomedical Journals published in Foreign Languages
An automated labeling (AL) module is developed to produce bibliographic records such as English title, vernacular title, author, affiliation, and English abstract from biomedical articles published in foreign language journals. Optical character recognition (OCR) output from scanned biomedical journals is used in this labeling process. Since frequently occurring words in a zone are important fe...
متن کاملAutomated Labeling Algorithms for Biomedical Document Images
The National Library of Medicine (NLM) has developed an automated system, named Medical Article Records System (MARS), to process bibliographic data (title, authors, affiliation, abstract, etc.) in biomedical journal articles for its MEDLINE database. This paper describes a labeling module in the MARS, which automatically extract the bibliographic data in biomedical journal articles. The label...
متن کاملAutomated data entry system: performance issues
This paper discusses the performance of a system for extracting bibliographic fields from scanned pages in biomedical journals to populate MEDLINE®, the flagship database of the National Library of Medicine (NLM), and heavily used worldwide. This system consists of automated processes to extract the article title, author names, affiliations and abstract, and manual workstations for the entry of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005